What is the intended goal?
Define customer clusters according to delivered data. Then, we have to find out the key features to group the customers
What are the characteristics that should be common in a cluster customer?
Find out the customer clusters
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook will give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they wish to explore different techniques.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To compute distances
from scipy.spatial.distance import cdist
# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer
# Importing PCA
from sklearn.decomposition import PCA
# To encode the variable
from sklearn.preprocessing import LabelEncoder
# Importing TSNE
from sklearn.manifold import TSNE
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To compute distances
from scipy.spatial.distance import pdist
# To import K-Medoids
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import DBSCAN
# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Library to extract datetime features
from datetime import datetime
from datetime import timedelta
# constants.
year_today = datetime.today().year
# loading the dataset
data_original = pd.read_csv("F:/DataScienceAI/capstone project/marketing_campaign.csv")
data = data_original.copy()
# Print the shape of the data
data_shape = data.shape
print(data_shape)
(2240, 27)
We have 2240 rows and 27 columns **
# View first 5 rows
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
# View last 5 rows Hint: Use tail() method
data.tail()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
Non-numeric variables: Education, Marital_status Dt_customer is a date in dd-mm-yyyy format **
# Check the datatypes of each column. Hint: Use info() method
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 2240.0 | 5592.159821 | 3246.662198 | 0.0 | 2828.25 | 5458.5 | 8427.75 | 11191.0 |
| Year_Birth | 2240.0 | 1968.805804 | 11.984069 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
# We have null data in Income column
# Find the percentage of missing values in each column of the data
data['Income'].count()
percentage_missing = 100 - (data['Income'].count() / 2240 *100)
print(percentage_missing)
1.0714285714285694
# replacing missing values
data['Income'] = np.where(data['Income'].isnull() , data['Income'].mean(), data['Income'])
data['Income'] has missing values 1.1% of values are no present. The null values were replaced by the mean of the column
we have 27 columns. 3 Object, 1 float, 23 Int
Year_Birth: Min value 1893, inconsistent data. Format : %Y
Dt_Customer ; Object, dd-mm-yyyy
**
We can observe that ID has no null values. Also the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.
Dropping the ID column
# Remove ID column from data. Hint: Use inplace = True
data.drop(['ID'], axis =1, inplace = True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2240 non-null int64 1 Education 2240 non-null object 2 Marital_Status 2240 non-null object 3 Income 2240 non-null float64 4 Kidhome 2240 non-null int64 5 Teenhome 2240 non-null int64 6 Dt_Customer 2240 non-null object 7 Recency 2240 non-null int64 8 MntWines 2240 non-null int64 9 MntFruits 2240 non-null int64 10 MntMeatProducts 2240 non-null int64 11 MntFishProducts 2240 non-null int64 12 MntSweetProducts 2240 non-null int64 13 MntGoldProds 2240 non-null int64 14 NumDealsPurchases 2240 non-null int64 15 NumWebPurchases 2240 non-null int64 16 NumCatalogPurchases 2240 non-null int64 17 NumStorePurchases 2240 non-null int64 18 NumWebVisitsMonth 2240 non-null int64 19 AcceptedCmp3 2240 non-null int64 20 AcceptedCmp4 2240 non-null int64 21 AcceptedCmp5 2240 non-null int64 22 AcceptedCmp1 2240 non-null int64 23 AcceptedCmp2 2240 non-null int64 24 Complain 2240 non-null int64 25 Response 2240 non-null int64 dtypes: float64(1), int64(22), object(3) memory usage: 455.1+ KB
# Explore basic summary statistics of numeric variables. Hint: Use describe() method.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.805804 | 11.984069 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2240.0 | 52247.251354 | 25037.797168 | 1730.0 | 35538.75 | 51741.5 | 68289.75 | 666666.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
Income: Std is high then probably we have to review outliers ( Max number for instance ) 75% of values are less than 68289.75
Kidhome & Teenhome: Homogeneous and categorical data (0,1 or 2)
**
# List of the categorical columns in the data
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome", "Complain"]
Number of unique observations in each category
for column in cols:
print("Unique values in", column, "are :")
print(data[column].unique())
print("*" * 50)
Unique values in Education are : ['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle'] ************************************************** Unique values in Marital_Status are : ['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO'] ************************************************** Unique values in Kidhome are : [0 1 2] ************************************************** Unique values in Teenhome are : [0 1 2] ************************************************** Unique values in Complain are : [0 1] **************************************************
Think About It:
# Replace the category "2n Cycle" with the category "Master" and Master as PhD. Studies after the bachelor.
data['Education'] = np.where(data['Education'] == '2n Cycle', 'Master', data['Education'])
data['Education'] = np.where(data['Education'] == 'PhD', 'Master', data['Education'])
# replacing strings by numbers:
# Basic = 0
# Graduation = 1
# Master = 2
data['Education'] = np.where(data['Education'] == 'Basic', 0, data['Education']) # Hint: Use the replace() method and inplace=True
data['Education'] = np.where(data['Education'] == 'Graduation', 1, data['Education']) # Hint: Use the replace() method and inplace=True
data['Education'] = np.where(data['Education'] == 'Master', 2, data['Education']) # Hint: Use the replace() method and inplace=True
data['Education'] = data['Education'].astype('int64')
# Replace the categories "Alone", "Abusrd", "YOLO" with the category "Single"
# Together and Married status could be grouped in the same category.
# Single = 0 + widow + divorced.
# Relationship = 1
#
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Alone', 0, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Absurd', 0, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'YOLO', 0, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Together', 1, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Single', 0, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Married', 1, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Widow', 0, data['Marital_Status'])
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Divorced', 0, data['Marital_Status'])
data['Marital_Status'] = data['Marital_Status'].astype('int64')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2240 non-null int64 1 Education 2240 non-null int64 2 Marital_Status 2240 non-null int64 3 Income 2240 non-null float64 4 Kidhome 2240 non-null int64 5 Teenhome 2240 non-null int64 6 Dt_Customer 2240 non-null object 7 Recency 2240 non-null int64 8 MntWines 2240 non-null int64 9 MntFruits 2240 non-null int64 10 MntMeatProducts 2240 non-null int64 11 MntFishProducts 2240 non-null int64 12 MntSweetProducts 2240 non-null int64 13 MntGoldProds 2240 non-null int64 14 NumDealsPurchases 2240 non-null int64 15 NumWebPurchases 2240 non-null int64 16 NumCatalogPurchases 2240 non-null int64 17 NumStorePurchases 2240 non-null int64 18 NumWebVisitsMonth 2240 non-null int64 19 AcceptedCmp3 2240 non-null int64 20 AcceptedCmp4 2240 non-null int64 21 AcceptedCmp5 2240 non-null int64 22 AcceptedCmp1 2240 non-null int64 23 AcceptedCmp2 2240 non-null int64 24 Complain 2240 non-null int64 25 Response 2240 non-null int64 dtypes: float64(1), int64(24), object(1) memory usage: 455.1+ KB
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.
# Create histogram for all feature
for col in data.columns :
print(col)
plt.figure(figsize=(20, 7))
sns.histplot(x=data[col], data = data )
plt.show()
Year_Birth
Education
Marital_Status
Income
Kidhome
Teenhome
Dt_Customer
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response
**Observations:
Year_Birth : Clearly, we have inconsistent data. This column will be transformed to the age. Education: Already normalized. Marital_Status : Already normalized. Income : Outliers to review and remove.
The following features: MntWines,MntFruits,,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds are the same unit and the same behaivor, we can summarized.
**
# Plot the boxplot
for col in data.columns:
print(col)
try:
sns.boxplot(data=data, x=data[col], showmeans=True, color="green")
plt.show()
except:
print ('non numeric feature')
Year_Birth
Education
Marital_Status
Income
Kidhome
Teenhome
Dt_Customer non numeric feature Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response
Year_Birth : outliers to remove, inconsistent data. Income : There is outliers in this column Income, this value is over 600,000 as we said in the info() method, where we can see the max values is far away from the mean of the Income
MntWines : Outliers some registers above 1200, but some bottles could be expensive.
Education:
**
Think About It
# Calculating the upper whisker for the Income variable
Q1 = data.quantile(q=0.25) # Finding the first quartile
Q3 = data.quantile(q=0.75) # Finding the third quartile
IQR = Q3-Q1 # Finding the Inter Quartile Range
upper_whisker = (Q3 + 1.5 * IQR)['Income'] # Calculating the Upper Whisker for the Income variable
print(upper_whisker) # Printing Upper Whisker
117416.25
# Let's check the observations with extreme value for the Income variable
index = data.index
ids = index[data.Income > upper_whisker]
print(ids)
print (len(ids))
Int64Index([164, 617, 655, 687, 1300, 1653, 2132, 2233], dtype='int64') 8
# Dropping observations identified as outliers
# Pass the indices of the observations (separated by a comma) to drop them
data.drop(ids, inplace=True)
# Let's check the observations with extreme value for the Income variable
# we have 3 records with Year_Birth less than 1940, in fact they are: 1900,1893,1899. This data is inconsistent.
# Removing this outliers.
index = data.index
ids = index[data.Year_Birth < 1940 ]
data.drop(ids,inplace=True)
data['Age'] = year_today - data['Year_Birth']
data['kids'] = data['Kidhome'] + data['Teenhome']
data = data.drop(['Year_Birth'], axis =1)
# In the same way as Age and Year_Birth, we can calculate the Engaged_in_days in account de DT_Customer.
data["ageing_customer"] = datetime.today() - pd.to_datetime(data['Dt_Customer'], format="%d-%m-%Y")
data["ageing_customer"] = data["ageing_customer"].dt.days
# now we can remove Dt_Customer.
data = data.drop(['Dt_Customer'], axis =1)
data.head()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 65 | 0 | 3684 |
| 1 | 1 | 0 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 68 | 2 | 3134 |
| 2 | 1 | 1 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 57 | 0 | 3333 |
| 3 | 1 | 1 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 38 | 1 | 3160 |
| 4 | 2 | 1 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 1 | 3182 |
5 rows × 27 columns
Now, let's check the distribution of the Income variable after dropping outliers.
# Plot histogram by Columns
for col in data.columns :
print(col)
sns.histplot(data=data, x=data[col], color="violet")
plt.show()
Education
Marital_Status
Income
Kidhome
Teenhome
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response
Age
kids
ageing_customer
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2229 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Education 2229 non-null int64 1 Marital_Status 2229 non-null int64 2 Income 2229 non-null float64 3 Kidhome 2229 non-null int64 4 Teenhome 2229 non-null int64 5 Recency 2229 non-null int64 6 MntWines 2229 non-null int64 7 MntFruits 2229 non-null int64 8 MntMeatProducts 2229 non-null int64 9 MntFishProducts 2229 non-null int64 10 MntSweetProducts 2229 non-null int64 11 MntGoldProds 2229 non-null int64 12 NumDealsPurchases 2229 non-null int64 13 NumWebPurchases 2229 non-null int64 14 NumCatalogPurchases 2229 non-null int64 15 NumStorePurchases 2229 non-null int64 16 NumWebVisitsMonth 2229 non-null int64 17 AcceptedCmp3 2229 non-null int64 18 AcceptedCmp4 2229 non-null int64 19 AcceptedCmp5 2229 non-null int64 20 AcceptedCmp1 2229 non-null int64 21 AcceptedCmp2 2229 non-null int64 22 Complain 2229 non-null int64 23 Response 2229 non-null int64 24 Age 2229 non-null int64 25 kids 2229 non-null int64 26 ageing_customer 2229 non-null int64 dtypes: float64(1), int64(26) memory usage: 552.1 KB
# Plot the histogram for all columns
for col in data.columns:
print(col)
sns.boxplot(data=data, x=data[col], showmeans=True, color="red")
plt.show()
Education
Marital_Status
Income
Kidhome
Teenhome
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response
Age
kids
ageing_customer
Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # Length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # Width of the plot
y = p.get_y() + p.get_height() # Height of the plot
ax.annotate(percentage, (x, y), size = 12) # Annotate the percentage
plt.show() # Show the plot
We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.
Heat map can show a 2D correlation matrix between numerical features.
corr_matrix =data.iloc[ : ,0 : 26].corr()
corr_matrix.T
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Education | 1.000000 | 0.014989 | 0.119401 | -0.031041 | 0.088048 | -0.028680 | 0.133817 | -0.070698 | 0.006975 | -0.061636 | ... | -0.041767 | 0.000293 | 0.035424 | 0.011249 | -0.008086 | 0.000209 | -0.025773 | 0.066127 | 0.123008 | 0.041548 |
| Marital_Status | 0.014989 | 1.000000 | -0.012374 | 0.023602 | 0.031609 | -0.002807 | -0.006189 | -0.025400 | -0.028011 | -0.017265 | ... | 0.006227 | -0.018034 | -0.005650 | 0.016727 | 0.008502 | 0.002623 | 0.001146 | -0.150114 | 0.001459 | 0.039804 |
| Income | 0.119401 | -0.012374 | 1.000000 | -0.527213 | 0.042271 | 0.006777 | 0.728021 | 0.535246 | 0.693754 | 0.549958 | ... | -0.644450 | -0.011268 | 0.231291 | 0.414982 | 0.342897 | 0.104420 | -0.027517 | 0.174419 | 0.211486 | -0.347018 |
| Kidhome | -0.031041 | 0.023602 | -0.527213 | 1.000000 | -0.036461 | 0.009144 | -0.498230 | -0.373563 | -0.443248 | -0.388645 | ... | 0.448826 | 0.014500 | -0.162170 | -0.205456 | -0.172910 | -0.080084 | 0.036294 | -0.080514 | -0.234469 | 0.689881 |
| Teenhome | 0.088048 | 0.031609 | 0.042271 | -0.036461 | 1.000000 | 0.016246 | 0.003899 | -0.177128 | -0.267046 | -0.205301 | ... | 0.133056 | -0.043369 | 0.038374 | -0.191018 | -0.140964 | -0.012465 | 0.007442 | -0.155726 | 0.364148 | 0.698287 |
| Recency | -0.028680 | -0.002807 | 0.006777 | 0.009144 | 0.016246 | 1.000000 | 0.016797 | -0.003376 | 0.026651 | 0.001694 | ... | -0.020849 | -0.033067 | 0.018916 | 0.000942 | -0.019317 | -0.007946 | 0.005404 | -0.199078 | 0.017956 | 0.018318 |
| MntWines | 0.133817 | -0.006189 | 0.728021 | -0.498230 | 0.003899 | 0.016797 | 1.000000 | 0.387591 | 0.586156 | 0.398024 | ... | -0.327949 | 0.061456 | 0.373206 | 0.471790 | 0.354069 | 0.211109 | -0.036257 | 0.246534 | 0.162628 | -0.354055 |
| MntFruits | -0.070698 | -0.025400 | 0.535246 | -0.373563 | -0.177128 | -0.003376 | 0.387591 | 1.000000 | 0.562744 | 0.593941 | ... | -0.424034 | 0.014442 | 0.009850 | 0.211630 | 0.195001 | -0.010268 | -0.002988 | 0.125192 | 0.013360 | -0.395893 |
| MntMeatProducts | 0.006975 | -0.028011 | 0.693754 | -0.443248 | -0.267046 | 0.026651 | 0.586156 | 0.562744 | 1.000000 | 0.589813 | ... | -0.543223 | 0.020987 | 0.108349 | 0.385770 | 0.321606 | 0.034262 | -0.020647 | 0.247103 | 0.038417 | -0.510945 |
| MntFishProducts | -0.061636 | -0.017265 | 0.549958 | -0.388645 | -0.205301 | 0.001694 | 0.398024 | 0.593941 | 0.589813 | 1.000000 | ... | -0.452835 | -0.000243 | 0.016257 | 0.197715 | 0.260545 | 0.005094 | -0.019039 | 0.110609 | 0.042064 | -0.427105 |
| MntSweetProducts | -0.063742 | -0.019407 | 0.549835 | -0.371720 | -0.163651 | 0.023071 | 0.384885 | 0.566545 | 0.544347 | 0.578983 | ... | -0.430205 | 0.000862 | 0.027998 | 0.258439 | 0.241486 | 0.009801 | -0.020632 | 0.116545 | 0.018938 | -0.384810 |
| MntGoldProds | -0.085734 | -0.029930 | 0.412596 | -0.351059 | -0.021478 | 0.017559 | 0.384979 | 0.389145 | 0.366044 | 0.421147 | ... | -0.255997 | 0.123107 | 0.021862 | 0.175822 | 0.166619 | 0.049923 | -0.030170 | 0.139726 | 0.056934 | -0.267020 |
| NumDealsPurchases | 0.028340 | 0.024065 | -0.131051 | 0.231266 | 0.395713 | -0.002780 | 0.013908 | -0.133856 | -0.165686 | -0.141021 | ... | 0.364861 | -0.022863 | 0.016711 | -0.185887 | -0.125270 | -0.034008 | 0.004024 | 0.003424 | 0.074089 | 0.452311 |
| NumWebPurchases | 0.059840 | 0.001126 | 0.493599 | -0.365123 | 0.154624 | -0.010256 | 0.540786 | 0.295507 | 0.315014 | 0.291781 | ... | -0.065861 | 0.040796 | 0.154940 | 0.138128 | 0.154292 | 0.034383 | -0.013718 | 0.147032 | 0.153410 | -0.149525 |
| NumCatalogPurchases | 0.045244 | -0.017153 | 0.706752 | -0.517305 | -0.111976 | 0.030805 | 0.673390 | 0.515184 | 0.703668 | 0.564718 | ... | -0.528833 | 0.112332 | 0.148547 | 0.339294 | 0.325324 | 0.106978 | -0.018265 | 0.235325 | 0.140002 | -0.451658 |
| NumStorePurchases | 0.072788 | 0.007311 | 0.683315 | -0.504817 | 0.048310 | 0.001671 | 0.641490 | 0.462344 | 0.509984 | 0.459031 | ... | -0.445140 | -0.069873 | 0.178389 | 0.215589 | 0.182424 | 0.084811 | -0.012108 | 0.036697 | 0.138852 | -0.326601 |
| NumWebVisitsMonth | -0.041767 | 0.006227 | -0.644450 | 0.448826 | 0.133056 | -0.020849 | -0.327949 | -0.424034 | -0.543223 | -0.452835 | ... | 1.000000 | 0.059610 | -0.034455 | -0.279959 | -0.195869 | -0.004415 | 0.020326 | -0.007097 | -0.119885 | 0.417878 |
| AcceptedCmp3 | 0.000293 | -0.018034 | -0.011268 | 0.014500 | -0.043369 | -0.033067 | 0.061456 | 0.014442 | 0.020987 | -0.000243 | ... | 0.059610 | 1.000000 | -0.079936 | 0.080672 | 0.094422 | 0.058992 | 0.009822 | 0.253838 | -0.061412 | -0.021031 |
| AcceptedCmp4 | 0.035424 | -0.005650 | 0.231291 | -0.162170 | 0.038374 | 0.018916 | 0.373206 | 0.009850 | 0.108349 | 0.016257 | ... | -0.034455 | -0.079936 | 1.000000 | 0.307615 | 0.251025 | 0.298167 | -0.027079 | 0.176544 | 0.064121 | -0.088363 |
| AcceptedCmp5 | 0.011249 | 0.016727 | 0.414982 | -0.205456 | -0.191018 | 0.000942 | 0.471790 | 0.211630 | 0.385770 | 0.197715 | ... | -0.279959 | 0.080672 | 0.307615 | 1.000000 | 0.404459 | 0.211859 | -0.008312 | 0.327911 | -0.015654 | -0.285541 |
| AcceptedCmp1 | -0.008086 | 0.008502 | 0.342897 | -0.172910 | -0.140964 | -0.019317 | 0.354069 | 0.195001 | 0.321606 | 0.260545 | ... | -0.195869 | 0.094422 | 0.251025 | 0.404459 | 1.000000 | 0.163089 | -0.025006 | 0.293617 | 0.007949 | -0.225970 |
| AcceptedCmp2 | 0.000209 | 0.002623 | 0.104420 | -0.080084 | -0.012465 | -0.007946 | 0.211109 | -0.010268 | 0.034262 | 0.005094 | ... | -0.004415 | 0.058992 | 0.298167 | 0.211859 | 0.163089 | 1.000000 | -0.010925 | 0.162553 | 0.014806 | -0.066393 |
| Complain | -0.025773 | 0.001146 | -0.027517 | 0.036294 | 0.007442 | 0.005404 | -0.036257 | -0.002988 | -0.020647 | -0.019039 | ... | 0.020326 | 0.009822 | -0.027079 | -0.008312 | -0.025006 | -0.010925 | 1.000000 | 0.000042 | 0.004380 | 0.031388 |
| Response | 0.066127 | -0.150114 | 0.174419 | -0.080514 | -0.155726 | -0.199078 | 0.246534 | 0.125192 | 0.247103 | 0.110609 | ... | -0.007097 | 0.253838 | 0.176544 | 0.327911 | 0.293617 | 0.162553 | 0.000042 | 1.000000 | -0.018918 | -0.170480 |
| Age | 0.123008 | 0.001459 | 0.211486 | -0.234469 | 0.364148 | 0.017956 | 0.162628 | 0.013360 | 0.038417 | 0.042064 | ... | -0.119885 | -0.061412 | 0.064121 | -0.015654 | 0.007949 | 0.014806 | 0.004380 | -0.018918 | 1.000000 | 0.095842 |
| kids | 0.041548 | 0.039804 | -0.347018 | 0.689881 | 0.698287 | 0.018318 | -0.354055 | -0.395893 | -0.510945 | -0.427105 | ... | 0.417878 | -0.021031 | -0.088363 | -0.285541 | -0.225970 | -0.066393 | 0.031388 | -0.170480 | 0.095842 | 1.000000 |
26 rows × 26 columns
plt.figure(figsize=(15, 7)) # Setting the plot size
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Plotting the correlation plot
plt.show()
We have positive relations between Income and MntWines, NumCatalogPurchases and MntMeatProducts, NumCatalogPurchases and Income/
There is negative relations between: NumWebVisitsMonth and Income
The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.
print(sns.barplot(x=data['Education'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1680 entries, 1 to 2239 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Education 1680 non-null int64 1 Marital_Status 1680 non-null int64 2 Income 1680 non-null float64 3 Kidhome 1680 non-null int64 4 Teenhome 1680 non-null int64 5 Recency 1680 non-null int64 6 MntWines 1680 non-null int64 7 MntFruits 1680 non-null int64 8 MntMeatProducts 1680 non-null int64 9 MntFishProducts 1680 non-null int64 10 MntSweetProducts 1680 non-null int64 11 MntGoldProds 1680 non-null int64 12 NumDealsPurchases 1680 non-null int64 13 NumWebPurchases 1680 non-null int64 14 NumCatalogPurchases 1680 non-null int64 15 NumStorePurchases 1680 non-null int64 16 NumWebVisitsMonth 1680 non-null int64 17 AcceptedCmp3 1680 non-null int64 18 AcceptedCmp4 1680 non-null int64 19 AcceptedCmp5 1680 non-null int64 20 AcceptedCmp1 1680 non-null int64 21 AcceptedCmp2 1680 non-null int64 22 Complain 1680 non-null int64 23 Response 1680 non-null int64 24 Age 1680 non-null int64 25 kids 1680 non-null int64 26 ageing_customer 1680 non-null int64 27 Family_Size 1680 non-null int64 28 Expenses 1680 non-null int64 29 NumTotalPurchases 1680 non-null int64 30 TotalAcceptedCmp 1680 non-null int64 31 AmountPerPurchase 1680 non-null int64 dtypes: float64(1), int64(31) memory usage: 497.7 KB
print(sns.barplot(x=data['Education'], y=data['Expenses'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
0 : Basic Education has the lowest income. Graduation and Master could be joined Las categorias Graduation y Master tienen income similares, en terminos generales, podriamos simplificar.
data['Education'] = np.where(data['Education'] == '2', '1', data['Education'])
data['Education'] = data['Education'].astype('int64')
# Plot the bar plot for Marital_Status and Income
print(sns.barplot(x=data['Marital_Status'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x=data['kids'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x=data['kids'], y=data['Expenses'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
We can also visualize the relationship between two categorical variables.
# Plot the bar plot for Marital_Status and Kidhome
pd.crosstab(data['Marital_Status'],data['kids']).plot(kind='hist', stacked=False)
<AxesSubplot:ylabel='Frequency'>
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2229 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Education 2229 non-null int64 1 Marital_Status 2229 non-null int64 2 Income 2229 non-null float64 3 Kidhome 2229 non-null int64 4 Teenhome 2229 non-null int64 5 Recency 2229 non-null int64 6 MntWines 2229 non-null int64 7 MntFruits 2229 non-null int64 8 MntMeatProducts 2229 non-null int64 9 MntFishProducts 2229 non-null int64 10 MntSweetProducts 2229 non-null int64 11 MntGoldProds 2229 non-null int64 12 NumDealsPurchases 2229 non-null int64 13 NumWebPurchases 2229 non-null int64 14 NumCatalogPurchases 2229 non-null int64 15 NumStorePurchases 2229 non-null int64 16 NumWebVisitsMonth 2229 non-null int64 17 AcceptedCmp3 2229 non-null int64 18 AcceptedCmp4 2229 non-null int64 19 AcceptedCmp5 2229 non-null int64 20 AcceptedCmp1 2229 non-null int64 21 AcceptedCmp2 2229 non-null int64 22 Complain 2229 non-null int64 23 Response 2229 non-null int64 24 Age 2229 non-null int64 25 kids 2229 non-null int64 26 ageing_customer 2229 non-null int64 dtypes: float64(1), int64(26) memory usage: 552.1 KB
# Add two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data['Marital_Status'] + data['kids'] + 1
data.head()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 65 | 0 | 3684 | 1 |
| 1 | 1 | 0 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 68 | 2 | 3134 | 3 |
| 2 | 1 | 1 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 57 | 0 | 3333 | 2 |
| 3 | 1 | 1 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 38 | 1 | 3160 | 3 |
| 4 | 2 | 1 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 1 | 3182 | 3 |
5 rows × 28 columns
# Create a new feature
# Add the amount spent on each of product 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
data["Expenses"] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] + data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']
data.tail()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2235 | 1 | 1 | 61223.0 | 0 | 1 | 46 | 709 | 43 | 182 | 42 | ... | 0 | 0 | 0 | 0 | 0 | 55 | 1 | 3402 | 3 | 1341 |
| 2236 | 2 | 1 | 64014.0 | 2 | 1 | 56 | 406 | 0 | 30 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 76 | 3 | 3040 | 5 | 444 |
| 2237 | 1 | 0 | 56981.0 | 0 | 0 | 91 | 908 | 48 | 217 | 32 | ... | 0 | 0 | 0 | 0 | 0 | 41 | 0 | 3176 | 1 | 1241 |
| 2238 | 2 | 1 | 69245.0 | 0 | 1 | 8 | 428 | 30 | 214 | 80 | ... | 0 | 0 | 0 | 0 | 0 | 66 | 1 | 3177 | 3 | 843 |
| 2239 | 2 | 1 | 52869.0 | 1 | 1 | 40 | 84 | 3 | 61 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 68 | 2 | 3643 | 4 | 172 |
5 rows × 29 columns
# Create a new feature
# Add the number of purchases from each channel 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
data["NumTotalPurchases"] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']
data.describe()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | ... | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 | 2229.000000 |
| mean | 1.447286 | 0.644235 | 51628.825945 | 0.444594 | 0.507402 | 49.106326 | 304.991476 | 26.348587 | 165.283984 | 37.642441 | ... | 0.064603 | 0.013010 | 0.008973 | 0.149843 | 53.111261 | 0.951996 | 3374.777479 | 2.596231 | 605.541050 | 14.869000 |
| std | 0.543881 | 0.478852 | 20601.302353 | 0.538636 | 0.544735 | 28.946476 | 336.761943 | 39.764060 | 219.336589 | 54.700548 | ... | 0.245879 | 0.113344 | 0.094319 | 0.356998 | 11.707430 | 0.751976 | 202.248700 | 0.907432 | 601.032228 | 7.622187 |
| min | 0.000000 | 0.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 26.000000 | 0.000000 | 3021.000000 | 1.000000 | 5.000000 | 0.000000 |
| 25% | 1.000000 | 0.000000 | 35416.000000 | 0.000000 | 0.000000 | 24.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 45.000000 | 0.000000 | 3202.000000 | 2.000000 | 69.000000 | 8.000000 |
| 50% | 1.000000 | 1.000000 | 51563.000000 | 0.000000 | 0.000000 | 49.000000 | 176.000000 | 8.000000 | 67.000000 | 12.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 52.000000 | 1.000000 | 3377.000000 | 3.000000 | 397.000000 | 15.000000 |
| 75% | 2.000000 | 1.000000 | 68118.000000 | 1.000000 | 1.000000 | 74.000000 | 505.000000 | 33.000000 | 231.000000 | 50.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 63.000000 | 1.000000 | 3550.000000 | 3.000000 | 1044.000000 | 21.000000 |
| max | 2.000000 | 1.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 82.000000 | 3.000000 | 3720.000000 | 5.000000 | 2525.000000 | 43.000000 |
8 rows × 30 columns
# Add all the campaign related variables to get the total number of accepted campaigns by a customer
# "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]
data.head(100)
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 0 | 0 | 1 | 65 | 0 | 3684 | 1 | 1617 | 25 | 1 |
| 1 | 1 | 0 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 0 | 0 | 68 | 2 | 3134 | 3 | 27 | 6 | 0 |
| 2 | 1 | 1 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 0 | 0 | 57 | 0 | 3333 | 2 | 776 | 21 | 0 |
| 3 | 1 | 1 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 0 | 0 | 38 | 1 | 3160 | 3 | 53 | 8 | 0 |
| 4 | 2 | 1 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 0 | 0 | 41 | 1 | 3182 | 3 | 422 | 19 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 1 | 1 | 30096.0 | 1 | 0 | 30 | 5 | 3 | 11 | 12 | ... | 0 | 0 | 0 | 39 | 1 | 3059 | 3 | 45 | 6 | 0 |
| 96 | 1 | 0 | 47916.0 | 0 | 1 | 72 | 505 | 0 | 26 | 0 | ... | 0 | 0 | 0 | 62 | 1 | 3605 | 2 | 606 | 22 | 1 |
| 97 | 1 | 0 | 51813.0 | 1 | 1 | 37 | 51 | 2 | 7 | 0 | ... | 0 | 0 | 0 | 50 | 2 | 3465 | 3 | 63 | 7 | 0 |
| 98 | 2 | 1 | 78497.0 | 0 | 0 | 44 | 207 | 26 | 447 | 75 | ... | 0 | 0 | 0 | 71 | 0 | 3231 | 2 | 978 | 25 | 1 |
| 99 | 2 | 1 | 50150.0 | 0 | 0 | 32 | 135 | 46 | 92 | 65 | ... | 0 | 0 | 0 | 39 | 0 | 3395 | 2 | 410 | 16 | 0 |
100 rows × 31 columns
# Divide the "Expebnses" by "NumTotalPurchases" to create the new feature AmountPerPurchase
# Let's get the observations equal 0
index = data.index
ids = index[data.NumTotalPurchases == 0]
print(ids)
Int64Index([981, 1524], dtype='int64')
data['Expenses'].describe()
count 2229.000000 mean 605.541050 std 601.032228 min 5.000000 25% 69.000000 50% 397.000000 75% 1044.000000 max 2525.000000 Name: Expenses, dtype: float64
data.drop(ids, inplace=True)
data['AmountPerPurchase'] = data['Expenses'] / data["NumTotalPurchases"]
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Education | 2227.0 | 1.447687 | 0.543960 | 0.000000 | 1.000000 | 1.000000 | 2.000000 | 2.0 |
| Marital_Status | 2227.0 | 0.644814 | 0.478677 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 |
| Income | 2227.0 | 51671.436925 | 20561.373156 | 1730.000000 | 35482.000000 | 51569.000000 | 68118.000000 | 113734.0 |
| Kidhome | 2227.0 | 0.444544 | 0.538667 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2.0 |
| Teenhome | 2227.0 | 0.507858 | 0.544767 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2.0 |
| Recency | 2227.0 | 49.116300 | 28.952535 | 0.000000 | 24.000000 | 49.000000 | 74.000000 | 99.0 |
| MntWines | 2227.0 | 305.263583 | 336.790652 | 0.000000 | 24.000000 | 177.000000 | 505.000000 | 1493.0 |
| MntFruits | 2227.0 | 26.371352 | 39.774657 | 0.000000 | 2.000000 | 8.000000 | 33.000000 | 199.0 |
| MntMeatProducts | 2227.0 | 165.431522 | 219.379791 | 0.000000 | 16.000000 | 68.000000 | 231.500000 | 1725.0 |
| MntFishProducts | 2227.0 | 37.675797 | 54.713778 | 0.000000 | 3.000000 | 12.000000 | 50.000000 | 259.0 |
| MntSweetProducts | 2227.0 | 27.185900 | 41.349490 | 0.000000 | 1.000000 | 8.000000 | 34.000000 | 263.0 |
| MntGoldProds | 2227.0 | 44.151774 | 52.098612 | 0.000000 | 9.000000 | 25.000000 | 56.500000 | 362.0 |
| NumDealsPurchases | 2227.0 | 2.321958 | 1.894826 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 15.0 |
| NumWebPurchases | 2227.0 | 4.103727 | 2.773984 | 0.000000 | 2.000000 | 4.000000 | 6.000000 | 27.0 |
| NumCatalogPurchases | 2227.0 | 2.638976 | 2.796299 | 0.000000 | 0.000000 | 2.000000 | 4.000000 | 28.0 |
| NumStorePurchases | 2227.0 | 5.817692 | 3.238913 | 0.000000 | 3.000000 | 5.000000 | 8.000000 | 13.0 |
| NumWebVisitsMonth | 2227.0 | 5.326448 | 2.402050 | 0.000000 | 3.000000 | 6.000000 | 7.000000 | 20.0 |
| AcceptedCmp3 | 2227.0 | 0.073193 | 0.260511 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| AcceptedCmp4 | 2227.0 | 0.074989 | 0.263432 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| AcceptedCmp5 | 2227.0 | 0.072744 | 0.259774 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| AcceptedCmp1 | 2227.0 | 0.064661 | 0.245982 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| AcceptedCmp2 | 2227.0 | 0.013022 | 0.113394 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| Complain | 2227.0 | 0.008981 | 0.094361 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| Response | 2227.0 | 0.149978 | 0.357130 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| Age | 2227.0 | 53.111361 | 11.712074 | 26.000000 | 45.000000 | 52.000000 | 63.000000 | 82.0 |
| kids | 2227.0 | 0.952402 | 0.752043 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 3.0 |
| ageing_customer | 2227.0 | 3374.893579 | 202.197318 | 3021.000000 | 3202.000000 | 3377.000000 | 3550.000000 | 3720.0 |
| Family_Size | 2227.0 | 2.597216 | 0.907120 | 1.000000 | 2.000000 | 3.000000 | 3.000000 | 5.0 |
| Expenses | 2227.0 | 606.079928 | 601.032876 | 8.000000 | 69.000000 | 397.000000 | 1044.500000 | 2525.0 |
| NumTotalPurchases | 2227.0 | 14.882353 | 7.612563 | 1.000000 | 8.000000 | 15.000000 | 21.000000 | 43.0 |
| TotalAcceptedCmp | 2227.0 | 0.448586 | 0.890416 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 5.0 |
| AmountPerPurchase | 2227.0 | 33.274270 | 45.040897 | 0.533333 | 9.714286 | 23.352941 | 45.281773 | 1679.0 |
Now, let's check the maximum value of the AmountPerPurchase.
Think About It:
# Let's get the observations equal 0
index = data.index
ids = index[data.AmountPerPurchase > 46]
print(ids)
data.drop(ids, inplace=True)
#ids = index[data.AmountPerPurchase == 0]
#print(ids)
#data.drop(ids, inplace=True)
Int64Index([ 0, 12, 15, 18, 22, 29, 34, 40, 45, 49,
...
2193, 2194, 2206, 2211, 2213, 2217, 2221, 2228, 2235, 2237],
dtype='int64', length=547)
#
data.describe()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | ... | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 |
| mean | 1.439286 | 0.655357 | 44111.917724 | 0.573214 | 0.587500 | 48.441071 | 170.642857 | 15.061905 | 71.968452 | 20.723810 | ... | 0.010714 | 0.102381 | 52.744643 | 1.160714 | 3364.927381 | 2.816071 | 328.027976 | 13.267262 | 0.249405 | 18.617871 |
| std | 0.557480 | 0.475393 | 17049.412847 | 0.549511 | 0.540849 | 29.039459 | 211.093201 | 27.958251 | 98.910845 | 36.262361 | ... | 0.102984 | 0.303239 | 11.277100 | 0.687074 | 199.717819 | 0.845464 | 345.212527 | 7.767198 | 0.613769 | 11.924125 |
| min | 0.000000 | 0.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 26.000000 | 0.000000 | 3021.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.533333 |
| 25% | 1.000000 | 0.000000 | 31486.250000 | 0.000000 | 0.000000 | 24.000000 | 15.000000 | 1.000000 | 11.000000 | 2.000000 | ... | 0.000000 | 0.000000 | 44.750000 | 1.000000 | 3190.000000 | 2.000000 | 54.000000 | 7.000000 | 0.000000 | 8.000000 |
| 50% | 1.000000 | 1.000000 | 43141.000000 | 1.000000 | 1.000000 | 48.000000 | 70.000000 | 4.000000 | 30.000000 | 7.000000 | ... | 0.000000 | 0.000000 | 51.000000 | 1.000000 | 3360.000000 | 3.000000 | 166.000000 | 11.000000 | 0.000000 | 16.118056 |
| 75% | 2.000000 | 1.000000 | 56349.250000 | 1.000000 | 1.000000 | 74.000000 | 267.000000 | 15.000000 | 99.000000 | 20.000000 | ... | 0.000000 | 0.000000 | 62.000000 | 2.000000 | 3533.000000 | 3.000000 | 536.500000 | 19.000000 | 0.000000 | 27.616228 |
| max | 2.000000 | 1.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1200.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 1.000000 | 82.000000 | 3.000000 | 3720.000000 | 5.000000 | 1730.000000 | 43.000000 | 5.000000 | 45.960000 |
8 rows × 32 columns
Now, let's check the distribution of values in AmountPerPurchase column.
# Check the summary statistics of the AmountPerPurchase variable
data['AmountPerPurchase'] = data['AmountPerPurchase'].astype('int64')
data.AmountPerPurchase.describe()
count 1680.000000 mean 18.167857 std 11.919990 min 0.000000 25% 8.000000 50% 16.000000 75% 27.000000 max 45.000000 Name: AmountPerPurchase, dtype: float64
# Plot the histogram for the AmountPerPurchas variable
plt.figure(figsize=(20, 7))
sns.histplot(x=data['AmountPerPurchase'], data = data )
plt.show()
# Impute the missing values for the Income variable with the median
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1680 entries, 1 to 2239 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Education 1680 non-null int64 1 Marital_Status 1680 non-null int64 2 Income 1680 non-null float64 3 Kidhome 1680 non-null int64 4 Teenhome 1680 non-null int64 5 Recency 1680 non-null int64 6 MntWines 1680 non-null int64 7 MntFruits 1680 non-null int64 8 MntMeatProducts 1680 non-null int64 9 MntFishProducts 1680 non-null int64 10 MntSweetProducts 1680 non-null int64 11 MntGoldProds 1680 non-null int64 12 NumDealsPurchases 1680 non-null int64 13 NumWebPurchases 1680 non-null int64 14 NumCatalogPurchases 1680 non-null int64 15 NumStorePurchases 1680 non-null int64 16 NumWebVisitsMonth 1680 non-null int64 17 AcceptedCmp3 1680 non-null int64 18 AcceptedCmp4 1680 non-null int64 19 AcceptedCmp5 1680 non-null int64 20 AcceptedCmp1 1680 non-null int64 21 AcceptedCmp2 1680 non-null int64 22 Complain 1680 non-null int64 23 Response 1680 non-null int64 24 Age 1680 non-null int64 25 kids 1680 non-null int64 26 ageing_customer 1680 non-null int64 27 Family_Size 1680 non-null int64 28 Expenses 1680 non-null int64 29 NumTotalPurchases 1680 non-null int64 30 TotalAcceptedCmp 1680 non-null int64 31 AmountPerPurchase 1680 non-null int64 dtypes: float64(1), int64(31) memory usage: 497.7 KB
Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.
data.describe()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | ... | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 |
| mean | 1.439286 | 0.655357 | 44111.917724 | 0.573214 | 0.587500 | 48.441071 | 170.642857 | 15.061905 | 71.968452 | 20.723810 | ... | 0.010714 | 0.102381 | 52.744643 | 1.160714 | 3364.927381 | 2.816071 | 328.027976 | 13.267262 | 0.249405 | 18.167857 |
| std | 0.557480 | 0.475393 | 17049.412847 | 0.549511 | 0.540849 | 29.039459 | 211.093201 | 27.958251 | 98.910845 | 36.262361 | ... | 0.102984 | 0.303239 | 11.277100 | 0.687074 | 199.717819 | 0.845464 | 345.212527 | 7.767198 | 0.613769 | 11.919990 |
| min | 0.000000 | 0.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 26.000000 | 0.000000 | 3021.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 0.000000 | 31486.250000 | 0.000000 | 0.000000 | 24.000000 | 15.000000 | 1.000000 | 11.000000 | 2.000000 | ... | 0.000000 | 0.000000 | 44.750000 | 1.000000 | 3190.000000 | 2.000000 | 54.000000 | 7.000000 | 0.000000 | 8.000000 |
| 50% | 1.000000 | 1.000000 | 43141.000000 | 1.000000 | 1.000000 | 48.000000 | 70.000000 | 4.000000 | 30.000000 | 7.000000 | ... | 0.000000 | 0.000000 | 51.000000 | 1.000000 | 3360.000000 | 3.000000 | 166.000000 | 11.000000 | 0.000000 | 16.000000 |
| 75% | 2.000000 | 1.000000 | 56349.250000 | 1.000000 | 1.000000 | 74.000000 | 267.000000 | 15.000000 | 99.000000 | 20.000000 | ... | 0.000000 | 0.000000 | 62.000000 | 2.000000 | 3533.000000 | 3.000000 | 536.500000 | 19.000000 | 0.000000 | 27.000000 |
| max | 2.000000 | 1.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1200.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 1.000000 | 82.000000 | 3.000000 | 3720.000000 | 5.000000 | 1730.000000 | 43.000000 | 5.000000 | 45.000000 |
8 rows × 32 columns
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis
plt.figure(figsize=(20, 10)) # Setting the plot size
sns.scatterplot(x = 'Income', y = 'Expenses', data = data) # Hint: Use sns.scatterplot()
plt.xticks(fontsize=16) # Font size of X-label
plt.yticks(fontsize=16) # Font size of Y-label
plt.xlabel("Income", fontsize=20, labelpad=20) # Title of X-axis
plt.ylabel("Expenses", fontsize=20, labelpad=20) # Title of Y-axis
Text(0, 0.5, 'Expenses')
# Plot the bar plot for Family Size on X-axis and Income on Y-axis
print(sns.barplot(x=data['Family_Size'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
print(sns.barplot(x=data['Family_Size'], y=data['Expenses'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
data.describe()
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | ... | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 | 1680.000000 |
| mean | 1.439286 | 0.655357 | 44111.917724 | 0.573214 | 0.587500 | 48.441071 | 170.642857 | 15.061905 | 71.968452 | 20.723810 | ... | 0.010714 | 0.102381 | 52.744643 | 1.160714 | 3364.927381 | 2.816071 | 328.027976 | 13.267262 | 0.249405 | 18.167857 |
| std | 0.557480 | 0.475393 | 17049.412847 | 0.549511 | 0.540849 | 29.039459 | 211.093201 | 27.958251 | 98.910845 | 36.262361 | ... | 0.102984 | 0.303239 | 11.277100 | 0.687074 | 199.717819 | 0.845464 | 345.212527 | 7.767198 | 0.613769 | 11.919990 |
| min | 0.000000 | 0.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 26.000000 | 0.000000 | 3021.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 0.000000 | 31486.250000 | 0.000000 | 0.000000 | 24.000000 | 15.000000 | 1.000000 | 11.000000 | 2.000000 | ... | 0.000000 | 0.000000 | 44.750000 | 1.000000 | 3190.000000 | 2.000000 | 54.000000 | 7.000000 | 0.000000 | 8.000000 |
| 50% | 1.000000 | 1.000000 | 43141.000000 | 1.000000 | 1.000000 | 48.000000 | 70.000000 | 4.000000 | 30.000000 | 7.000000 | ... | 0.000000 | 0.000000 | 51.000000 | 1.000000 | 3360.000000 | 3.000000 | 166.000000 | 11.000000 | 0.000000 | 16.000000 |
| 75% | 2.000000 | 1.000000 | 56349.250000 | 1.000000 | 1.000000 | 74.000000 | 267.000000 | 15.000000 | 99.000000 | 20.000000 | ... | 0.000000 | 0.000000 | 62.000000 | 2.000000 | 3533.000000 | 3.000000 | 536.500000 | 19.000000 | 0.000000 | 27.000000 |
| max | 2.000000 | 1.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1200.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 1.000000 | 82.000000 | 3.000000 | 3720.000000 | 5.000000 | 1730.000000 | 43.000000 | 5.000000 | 45.000000 |
8 rows × 32 columns
data.to_csv('F:/DataScienceAI/capstone project/data_transformed.csv')
data_model = data.copy()
# Dropping all the irrelevant columns and storing in data_model
data_model = data_model.drop(
columns=[
"Complain",
"Response",
"AcceptedCmp1",
"AcceptedCmp2",
"AcceptedCmp3",
"AcceptedCmp4",
"AcceptedCmp5",
"Marital_Status",
"MntWines",
"MntFruits",
"MntMeatProducts",
"MntFishProducts",
"MntSweetProducts",
"MntGoldProds",
'Education',
'Age',
"kids",
"NumDealsPurchases",
"NumWebPurchases",
"NumCatalogPurchases",
"NumStorePurchases",
"NumWebVisitsMonth",
'Family_Size',
'Kidhome',
'Teenhome'
],
axis=1,
)
# Check the shape of new data
data_shape = data_model.shape
print(data_shape)
(1680, 7)
data_model.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1680 entries, 1 to 2239 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Income 1680 non-null float64 1 Recency 1680 non-null int64 2 ageing_customer 1680 non-null int64 3 Expenses 1680 non-null int64 4 NumTotalPurchases 1680 non-null int64 5 TotalAcceptedCmp 1680 non-null int64 6 AmountPerPurchase 1680 non-null int64 dtypes: float64(1), int64(6) memory usage: 169.5 KB
data_model.head()
| Income | Recency | ageing_customer | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|
| 1 | 46344.0 | 38 | 3134 | 27 | 6 | 0 | 4 |
| 2 | 71613.0 | 26 | 3333 | 776 | 21 | 0 | 36 |
| 3 | 26646.0 | 26 | 3160 | 53 | 8 | 0 | 6 |
| 4 | 58293.0 | 94 | 3182 | 422 | 19 | 0 | 22 |
| 5 | 62513.0 | 16 | 3314 | 716 | 22 | 0 | 32 |
corr_matrix =data_model.iloc[ : ,0 : 8].corr()
corr_matrix
plt.figure(figsize=(15, 7)) # Setting the plot size
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Plotting the correlation plot
plt.show()
Observations
NumTotalPurchases and Expenses Expenses and AmountPerPurchase NumTotalPurchases and AmountPerPurchase There is no negative correlations.
# Applying standard scaler on new data
# Initialize the Standard Scaler
scaler = StandardScaler()
# fit_transform the scaler function on new data
df_scaled = scaler.fit_transform(data_model)
# Converting the embeddings to a dataframe
df_scaled = pd.DataFrame(df_scaled, columns=data_model.columns)
df_scaled.head()
| Income | Recency | ageing_customer | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|
| 0 | 0.130957 | -0.359655 | -1.156613 | -0.872267 | -0.935914 | -0.406471 | -1.188934 |
| 1 | 1.613503 | -0.773009 | -0.159910 | 1.298057 | 0.995860 | -0.406471 | 1.496432 |
| 2 | -1.024734 | -0.773009 | -1.026390 | -0.796929 | -0.678344 | -0.406471 | -1.021098 |
| 3 | 0.832011 | 1.569330 | -0.916202 | 0.272296 | 0.738290 | -0.406471 | 0.321584 |
| 4 | 1.079601 | -1.117470 | -0.255073 | 1.124199 | 1.124645 | -0.406471 | 1.160761 |
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
# Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
tsne = TSNE(n_components = 2, random_state =1, perplexity=35)
data_air_pol_tsne = tsne.fit_transform(df_scaled) # fit_transform T-SNE on new data
data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns=[0, 1]) # Converting the embeddings to a dataframe
plt.figure(figsize=(7, 7)) # Scatter plot for two components
sns.scatterplot(x=0, y=1, data=data_air_pol_tsne) # Plotting T-SNE
<AxesSubplot:xlabel='0', ylabel='1'>
# Defining the number of principal components to generate
n = data_model.shape[1] # Storing the number of variables in the data
pca = PCA (n_components = n, random_state = 1) # Initialize PCA with n_components = n and random_state=1
data_pca = pd.DataFrame(pca.fit_transform(df_scaled)) # fit_transform PCA on the scaled data
# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_
sns.scatterplot(x=0, y=1, data=data_pca)
<AxesSubplot:xlabel='0', ylabel='1'>
distortions = [] # Create an empty list
K = range(2, 10) # Setting the K range from 2 to 10
for k in K:
kmeanModel = KMeans(n_clusters=k,random_state=1) # Initialize K-Means
kmeanModel.fit(data_pca) # Fit K-Means on the data
distortions.append(kmeanModel.inertia_) # Append distortion values to the empty list created above
# Plotting the elbow plot
plt.figure(figsize=(16, 8)) # Setting the plot size
plt.plot(K, distortions, "bx-") # Plotting the K on X-axis and distortions on y-axis
plt.xlabel("k") # Title of x-axis
plt.ylabel("Distortion") # Title of y-axis
plt.title("The Elbow Method showing the optimal k") # Title of the plot
plt.show()
data_pca.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | -1.619710 | -0.366646 | 1.251773 | 0.007878 | -0.488247 | 0.317618 | 0.188856 |
| 1 | 2.595637 | -0.270793 | 0.799426 | -0.810939 | -0.368801 | -0.251374 | 0.033878 |
| 2 | -1.855042 | 0.102592 | 1.036322 | -0.366075 | 0.564947 | 0.157294 | 0.024897 |
| 3 | 0.873202 | -1.765879 | 0.094906 | 0.897002 | -0.049336 | 0.369427 | -0.221308 |
| 4 | 2.162642 | 0.017551 | 0.917874 | -0.997418 | 0.064268 | -0.007288 | -0.041550 |
sil_score = {} # Creating empty list
cluster_list = range(3, 7) # Creating a range from 3 to 7
for k in cluster_list:
# Initialize K-Means with number of clusters equal to n_clusters and random_state=1
clusterer = KMeans (n_clusters = k, random_state=1).fit(data_pca)
# Fit and predict on the pca data
preds = clusterer.predict(data_pca)
# Calculate silhouette score - Hint: Use silhouette_score() function
score =silhouette_score(data_pca, preds)
# Append silhouette score to empty list created above
sil_score[k] = score
# Print the silhouette scorea
print( "For n_clusters = {}, the silhouette score is {})".format(k, score))
For n_clusters = 3, the silhouette score is 0.334493184663598) For n_clusters = 4, the silhouette score is 0.3432079436513443) For n_clusters = 5, the silhouette score is 0.2835368980665695) For n_clusters = 6, the silhouette score is 0.23965552654173458)
# Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans = KMeans (n_clusters=3 , random_state =1)
kmeans.fit(data_pca)
KMeans(n_clusters=3, random_state=1)
data_pca["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data ["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to data_model
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
0 631 1 562 2 487 Name: K_means_segments_3, dtype: int64
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_Kmeans_3 = data.groupby("K_means_segments_3").mean()
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_Kmeans_3.style.highlight_max(color="lightgreen", axis=0)
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_3 | ||||||||||||||||||||||||||||||||
| 0 | 1.483360 | 0.646593 | 36835.120461 | 0.787639 | 0.513471 | 50.507132 | 40.749604 | 4.004754 | 19.671949 | 5.679873 | 4.163233 | 12.090333 | 1.923930 | 2.007924 | 0.481775 | 3.270998 | 5.895404 | 0.055468 | 0.011094 | 0.000000 | 0.000000 | 0.003170 | 0.007924 | 0.033281 | 51.448494 | 1.301109 | 3205.152139 | 2.947702 | 86.359746 | 7.684628 | 0.103011 | 9.394612 |
| 1 | 1.503559 | 0.670819 | 61458.895922 | 0.213523 | 0.781139 | 47.450178 | 404.943060 | 34.010676 | 161.761566 | 45.537367 | 35.416370 | 66.471530 | 3.279359 | 6.617438 | 3.907473 | 8.688612 | 5.012456 | 0.072954 | 0.103203 | 0.030249 | 0.048043 | 0.014235 | 0.010676 | 0.115658 | 55.891459 | 0.994662 | 3384.370107 | 2.665480 | 748.140569 | 22.492883 | 0.384342 | 32.247331 |
| 2 | 1.308008 | 0.648871 | 33521.891698 | 0.710472 | 0.459959 | 46.907598 | 68.560575 | 7.521561 | 36.106776 | 11.581109 | 7.406571 | 25.166324 | 2.616016 | 2.878850 | 0.804928 | 3.554415 | 7.143737 | 0.086242 | 0.018480 | 0.000000 | 0.002053 | 0.000000 | 0.014374 | 0.176591 | 50.792608 | 1.170431 | 3549.509240 | 2.819302 | 156.342916 | 9.854209 | 0.283368 | 13.287474 |
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','ageing_customer','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (50, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
data_pca.drop(['K_means_segments_3'], axis =1, inplace = True)
data.drop(['K_means_segments_3'], axis =1, inplace = True)
data_model.drop(['K_means_segments_3'], axis =1, inplace = True)
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
# Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans = KMeans (n_clusters=5 , random_state = 0)
# Fitting on the data_pca
kmeans.fit(data_pca)
KMeans(n_clusters=5, random_state=0)
data_pca["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data ["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_5"] = kmeans.labels_
# Let's check the distribution
data_model["K_means_segments_5"].value_counts()
1 479 2 410 3 399 0 336 4 56 Name: K_means_segments_5, dtype: int64
# Hint: Use PCA_PLOT function created above
#def PCA_PLOT(X, Y, PCA, cluster):
# sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_Kmeans_5 = data.groupby("K_means_segments_5").mean()
# Highlight the maximum average value among all the clusters for each of the variables
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_Kmeans_5.style.highlight_max(color="lightgreen", axis=0)
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_5 | ||||||||||||||||||||||||||||||||
| 0 | 1.464286 | 0.669643 | 64583.300611 | 0.178571 | 0.761905 | 47.589286 | 457.211310 | 43.863095 | 200.449405 | 57.047619 | 45.455357 | 77.324405 | 3.151786 | 6.955357 | 4.562500 | 9.827381 | 4.613095 | 0.041667 | 0.050595 | 0.011905 | 0.020833 | 0.000000 | 0.011905 | 0.041667 | 56.157738 | 0.940476 | 3390.607143 | 2.610119 | 881.351190 | 24.497024 | 0.166667 | 35.526786 |
| 1 | 1.471816 | 0.647182 | 34846.332478 | 0.841336 | 0.482255 | 45.492693 | 25.628392 | 3.196242 | 14.486430 | 4.559499 | 3.225470 | 10.252610 | 1.837161 | 1.784969 | 0.350731 | 2.981211 | 5.981211 | 0.070981 | 0.004175 | 0.000000 | 0.000000 | 0.004175 | 0.010438 | 0.050104 | 50.594990 | 1.323591 | 3178.455115 | 2.970772 | 61.348643 | 6.954071 | 0.129436 | 7.876827 |
| 2 | 1.304878 | 0.634146 | 29730.925017 | 0.812195 | 0.382927 | 52.619512 | 28.019512 | 5.231707 | 20.856098 | 7.270732 | 5.446341 | 15.548780 | 2.100000 | 1.960976 | 0.451220 | 2.992683 | 7.241463 | 0.058537 | 0.009756 | 0.000000 | 0.000000 | 0.000000 | 0.014634 | 0.134146 | 49.126829 | 1.195122 | 3532.712195 | 2.829268 | 82.373171 | 7.504878 | 0.202439 | 9.697561 |
| 3 | 1.491228 | 0.696742 | 50530.714929 | 0.378446 | 0.781955 | 49.724311 | 211.418546 | 14.598997 | 75.842105 | 21.912281 | 14.195489 | 41.355890 | 3.355890 | 5.000000 | 1.839599 | 5.839599 | 5.704261 | 0.045113 | 0.062657 | 0.000000 | 0.010025 | 0.000000 | 0.005013 | 0.070175 | 55.952381 | 1.160401 | 3382.553885 | 2.857143 | 379.323308 | 16.035088 | 0.187970 | 22.902256 |
| 4 | 1.625000 | 0.500000 | 60092.946429 | 0.285714 | 0.553571 | 39.035714 | 445.303571 | 19.017857 | 139.375000 | 31.071429 | 25.410714 | 62.035714 | 3.428571 | 7.089286 | 4.857143 | 6.982143 | 6.357143 | 0.500000 | 0.464286 | 0.232143 | 0.303571 | 0.142857 | 0.017857 | 0.910714 | 54.285714 | 0.839286 | 3451.839286 | 2.339286 | 722.214286 | 22.357143 | 2.553571 | 30.321429 |
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (10, 10))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_5'], axis =1, inplace = True)
data.drop(['K_means_segments_5'], axis =1, inplace = True)
data_model.drop(['K_means_segments_5'], axis =1, inplace = True)
# Initializing K-Medoids with number of clusters as 5 and random_state=1
kmedo = KMedoids (n_clusters = 3, random_state = 1)
# Fit and predict K-Medoids using data_pca
kmedo.fit(data_pca)
preds = kmedo.predict(data_pca)
# Calculate the silhouette score
score = silhouette_score(data_pca, preds)
print(score) # Print the score
0.1988922758795572
# Predicting on data_pca and ddding K-Medoids cluster labels to the whole data
data['kmedoLabels'] = preds
# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['kmedoLabels'] = preds
# Predicting on data_pca and ddding K-Medoids cluster labels to data_pca
data_pca['kmedoLabels'] = preds
# Let's check the distribution
data["kmedoLabels"].value_counts()
2 662 1 627 0 391 Name: kmedoLabels, dtype: int64
PCA_PLOT(0, 1, data_pca, "kmedoLabels")
cluster_profile_kmedoLabels = data.groupby("kmedoLabels").mean()
cluster_profile_kmedoLabels.style.highlight_max(color="lightgreen", axis=0)
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | kids | ageing_customer | Family_Size | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| kmedoLabels | ||||||||||||||||||||||||||||||||
| 0 | 1.501279 | 0.667519 | 39639.858720 | 0.731458 | 0.562660 | 74.212276 | 59.690537 | 5.360614 | 29.703325 | 7.437340 | 5.808184 | 17.020460 | 2.166240 | 2.452685 | 0.713555 | 3.593350 | 5.831202 | 0.061381 | 0.028133 | 0.000000 | 0.005115 | 0.000000 | 0.000000 | 0.020460 | 52.554987 | 1.294118 | 3234.598465 | 2.961637 | 125.020460 | 8.925831 | 0.115090 | 11.941176 |
| 1 | 1.349282 | 0.642743 | 31165.359261 | 0.811802 | 0.414673 | 35.393939 | 29.599681 | 4.443381 | 17.784689 | 6.047847 | 4.287081 | 13.181818 | 2.019139 | 1.923445 | 0.414673 | 2.990431 | 6.791069 | 0.078150 | 0.004785 | 0.000000 | 0.000000 | 0.003190 | 0.017544 | 0.133971 | 49.639553 | 1.226475 | 3400.448166 | 2.869219 | 75.344498 | 7.347687 | 0.220096 | 8.829346 |
| 2 | 1.487915 | 0.660121 | 59015.342537 | 0.253776 | 0.765861 | 45.577039 | 369.761329 | 30.848943 | 148.250755 | 42.471299 | 31.992447 | 63.930514 | 3.350453 | 6.379154 | 3.554381 | 8.154079 | 5.253776 | 0.067976 | 0.090634 | 0.025680 | 0.039275 | 0.012085 | 0.010574 | 0.120846 | 55.797583 | 1.019637 | 3408.261329 | 2.679758 | 687.255287 | 21.438066 | 0.356495 | 30.690332 |
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (10, 10))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['kmedoLabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['kmedoLabels'], axis =1, inplace = True)
data.drop(['kmedoLabels'], axis =1, inplace = True)
data_model.drop(['kmedoLabels'], axis =1, inplace = True)
Let's find the Cophenetic correlation for different distances with different linkage methods.
The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.
The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.
Cophenetic correlation is the way to compare two or more dendrograms.
Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average"]
high_cophenet_corr = 0 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0] # Creating a list by assigning 0's to it
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(data_pca, metric=dm, method=lm) # Applying different linkages with different distance on data_pca
c, coph_dists = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c: # Checking if cophenetic correlation is higher than previous score
high_cophenet_corr = c # Appending to high_cophenet_corr list if it is higher
high_dm_lm[0] = dm # Appending its corresponding distance
high_dm_lm[1] = lm # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.4519580951404056. Cophenetic correlation for Euclidean distance and complete linkage is 0.6618285441876266. Cophenetic correlation for Euclidean distance and average linkage is 0.7456986470939312. Cophenetic correlation for Chebyshev distance and single linkage is 0.461008202489796. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6471878469387611. Cophenetic correlation for Chebyshev distance and average linkage is 0.7184644061096592. Cophenetic correlation for Mahalanobis distance and single linkage is 0.5049187196205653. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.4893803759051994. Cophenetic correlation for Mahalanobis distance and average linkage is 0.6175117901276773. Cophenetic correlation for Cityblock distance and single linkage is 0.5318044327779199. Cophenetic correlation for Cityblock distance and complete linkage is 0.7105065386093503. Cophenetic correlation for Cityblock distance and average linkage is 0.7254498869142997.
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.7456986470939312, which is obtained with Euclidean distance and average linkage.
Let's have a look at the dendrograms for different linkages with Cityblock distance
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Cityblock", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights:
Think about it:
Let's have a look at the dendrograms for different linkages with Chebyshev distance
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Chebyshev", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Let's have a look at the dendrograms for different linkages with Mahalanobis distance
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Mahalanobis", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Euclidean", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward',)
# Fit on data_pca
HCmodel.fit(data_pca)
AgglomerativeClustering(n_clusters=3)
# Add Agglomerative Clustering cluster labels to data_pca
data_pca['HCLabels'] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to the whole data
data['HCLabels'] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to data_model
data_model['HCLabels'] = HCmodel.labels_
# Let's check the distribution
data_pca.HCLabels.value_counts()
1 786 0 463 2 431 Name: HCLabels, dtype: int64
PCA_PLOT(0, 1, data_pca, "HCLabels")
cluster_profile_HCLabels_3 = data.groupby("HCLabels").mean()
cluster_profile_HCLabels_3.style.highlight_max(color="lightgreen", axis=0)
| Recency | ageing_customer | Expenses | NumTotalPurchases | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|
| HCLabels | ||||||
| 0 | 47.278618 | 3404.034557 | 812.701944 | 23.548596 | 0.401728 | 33.764579 |
| 1 | 59.708651 | 3279.685751 | 125.041985 | 8.863868 | 0.045802 | 11.178117 |
| 2 | 29.141531 | 3474.470998 | 177.547564 | 10.252900 | 0.457077 | 14.160093 |
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (10, 10))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['HCLabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['HCLabels'], axis =1, inplace = True)
data.drop(['HCLabels'], axis =1, inplace = True)
data_model.drop(['HCLabels'], axis =1, inplace = True)
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Initializing lists
eps_value = [2,3] # Taking random eps value
min_sample_values = [6,20] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.5050254980203432 For eps value =2 For min sample =20 The average silhoutte_score is : 0.47408529133829885 For eps value =3 For min sample =6 The average silhoutte_score is : 0.48036844597375694 For eps value =3 For min sample =20 The average silhoutte_score is : 0.5805508949885789
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps=3, min_samples=20).fit(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBSLabels'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBSLabels'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBSLabels'] = dbs.fit_predict(data_pca)
# Let's check the distribution
data_pca["DBSLabels"].value_counts()
0 1674 -1 6 Name: DBSLabels, dtype: int64
PCA_PLOT(0, 1, data_pca, "DBSLabels")
data.drop(['DBSLabels'], axis =1, inplace = True)
data_model.drop(['DBSLabels'], axis =1, inplace = True)
data_pca.drop(['DBSLabels'], axis =1, inplace = True)
gmm = GaussianMixture(n_components =5, random_state=1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1
# Fit and predict Gaussian Mixture Model using data_pca
gmm.fit(data_pca)
preds = gmm.predict(data_pca)
score = score =silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.1936138024032675
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data['GmmLabels'] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model['GmmLabels'] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca['GmmLabels'] = preds
# Let's check the distribution
data_model["GmmLabels"].value_counts()
3 704 0 664 2 125 4 116 1 71 Name: GmmLabels, dtype: int64
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (10, 10))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['GmmLabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()